perf(slirp): architectural experiments — io_uring / splice / multi-queue by dpsoft · Pull Request #83 · the-void-ia/void-box

dpsoft · 2026-05-07T00:34:56Z

Goal

Stacked on top of #81. With #81's heaptrack-driven user-space alloc reductions exhausted (-90% allocs/iter, p50 unchanged at ~275 µs), the remaining wall-clock floor is dominated by kernel ↔ userspace transitions (per-packet read()/write()), per-vCPU MMIO exits, and single-queue serialization through net_poll_thread.

This branch is the playground for architectural experiments that change the syscall / vCPU shape. Full plan: docs/perf-architectural-experiments.md.

Non-goal: TAP / passt-style host bypass

Dropping SLIRP and routing through TAP + an external passt instance would close the latency gap to passt itself, but it would move the DNS interception, port-forwarding, deny-list, and rate-limiting feature surface out of voidbox into a separate process — and we lose the in-process observability we currently get from instrumenting SLIRP directly. Full SLIRP-path observability is a hard requirement, so passt-style bypass is out of scope.

Experiments (ranked by risk × payoff)

1. `io_uring` for SLIRP host-socket I/O — start here

Replace per-flow recv() + sendto() (one syscall per packet, serial in net_poll_thread) with batched IORING_OP_RECV / IORING_OP_SEND SQEs submitted in a single syscall after each epoll_wait.

Expected: ~10–30 µs CRR p50 reduction. Risk: lowest — localized to the relay layer's read/write helpers.

2. `splice()` / `sendfile()` zero-copy on bulk paths

splice() between the host-socket fd and a pipe to eliminate the userspace copy on the bulk-relay TX path. Only works fd-to-fd, so applies to payload bytes only (header rewriting stays in smoltcp).

Expected: +10–20% on tcp_throughput_g2h_mbps. Risk: medium — pipe-fd plumbing through the relay state machine.

3. MSI-X virtio + multi-queue for vCPU scaling

Add MSI-X support to src/vmm/arch/x86_64/ and expose VIRTIO_NET_F_MQ so the guest can spin up per-CPU queue pairs. Host fans out queues to multiple poll threads.

Expected: +50–100% throughput on multi-vCPU sandboxes. Risk: highest — touches IRQ delivery, KVM_IRQFD wiring, and is HW-feature-gated.

Tooling

Uses the perf-harness from #81:

examples/crr_singleproc_bench — single-process CRR latency (real NAT path).
voidbox-network-bench — g2h throughput, RR p50/p99.
heaptrack — alloc regression check.
tools/perf-harness/bench-pasta.py — pasta reference number.
tools/perf-harness/bench-qemu-slirp.sh — qemu+libslirp / qemu+passt cross-check.

Methodology

Each experiment lands as its own commit gated behind a Cargo feature (io-uring, splice-zerocopy, multi-queue) so the tools: passt/pasta head-to-head comparison harness #81 baseline can A/B against it without a revert.
Commit message includes before/after from crr_singleproc_bench --iterations 100 and voidbox-network-bench --iterations 3.
heaptrack after each commit confirms no alloc regression vs round-2 numbers (~41 allocs/iter).
If a commit doesn't move the needle, it's reverted before the next experiment so the diff stays minimal.

Test plan

io_uring POC builds + tests pass
CRR microbench shows measurable p50 improvement vs tools: passt/pasta head-to-head comparison harness #81 tip
No allocation regression vs round-2 numbers
(later) splice POC if io_uring wins are smaller than expected
(later) MSI-X / multi-queue if single-vCPU floor is hit

Two scripts and a doc, deferred deliverable from docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md § "passt head-to-head methodology". scripts/bench-pasta.py Drives the same workload shape as voidbox-network-bench (g2h throughput, RR p50/p99, CRR p50) against pasta running in a network namespace. Outputs JSON in the same Report shape so bench-compare-pasta.py can diff the two side by side. pasta is launched with --config-net + --map-host-loopback (default: gateway IP) so connecting to the host gateway from inside the netns reaches the host's 127.0.0.1. Mirrors voidbox's SLIRP convention (10.0.2.2 → 127.0.0.1) closely enough for the apples-to-apples CRR metric. scripts/bench-compare-pasta.py Reads two JSONs and emits a markdown side-by-side. Auto-detects which file is which via the `backend` field. Reports the gap as 'voidbox N× faster/slower' so the direction is unambiguous. docs/passt-comparison.md Caveats + usage. Calls out that throughput numbers are NOT directly comparable (voidbox has VM/MMIO overhead pasta does not). CRR latency is the apples-to-apples metric: dominated by NAT-table operations on both sides. Tested locally: pasta CRR p50 ≈ 80 µs, voidbox CRR p50 ≈ 10.1 ms on the same host. The gap is dominated by voidbox's poll-thread cadence + virtio-mmio exits, not NAT-table cost — a useful actionable signal for follow-up perf work.

Pair of artefacts used to root-cause the apparent 122x voidbox-vs-pasta CRR p50 gap reported by scripts/bench-pasta.py. tools/crr-client.c Static-linked C binary that performs N TCP CRRs in one process, no fork or exec per iteration. Output is one line of nanoseconds: N P50 P99 MEAN. Compile with: gcc -O2 -static -o /tmp/crr-client tools/crr-client.c examples/crr_singleproc_bench.rs Voidbox-side driver. Boots a sandbox with /tmp host-mounted into the guest, runs the static binary inside the guest, parses the one-line output. Measures voidbox's NAT-path CRR cost without the outer bench's per-iteration nc fork+exec. Result: voidbox-in-VM at 421 us p50 vs pasta-in-netns at 107 us p50 is dominated (~300 us of the ~314 us gap) by VM transit (virtio-mmio exits, KVM IRQ injection, vsock RPC), not by SLIRP-engine cost. A genuinely apples-to-apples SLIRP-vs-SLIRP comparison (passt+qemu vs voidbox+voidbox-VM) is the natural follow-up; this commit captures the tooling so that follow-up can stand on a reproducible baseline.

Boots a minimal qemu guest carrying tools/crr-client and runs N TCP CRRs against a host TCP server. Two backends: --backend libslirp qemu's built-in -netdev user (libslirp) --backend passt qemu -netdev stream + passt(1) over UNIX socket Same workload + iteration count as scripts/bench-pasta.py and examples/crr_singleproc_bench.rs, so the four datapoints (host-direct, pasta-in-netns, qemu+libslirp, qemu+passt, voidbox+voidbox-SLIRP) are directly comparable on the same machine. The script auto-builds the initramfs from tools/qemu-init.sh + busybox + tools/crr-client, including virtio_net + failover modules from the host kernel so a stock distro kernel can probe the qemu virtio-net-pci device. Voidbox's slim kernel has them built-in and the insmod calls fail harmlessly. Result on the dev machine: host-direct 63 us p50 pasta (netns, no VM) 107 us p50 qemu+libslirp (in VM) 181 us p50 qemu+passt (in VM) 163 us p50 voidbox+voidbox-SLIRP 421 us p50 Voidbox is ~2.2x slower than the mature C SLIRPs in the same VM-attached configuration -- the genuine engine gap, independent of fork artefact (10x) and VM transit (which both sides pay).

Four small wins on the per-packet path between the SlirpBackend's inject queue and the guest, identified by the SLIRP-vs-SLIRP comparison (voidbox 421 us p50 vs qemu+passt 163 us p50 on the single-process TCP CRR benchmark). src/devices/virtio_net.rs::try_inject_rx - Read avail.idx ONCE per call instead of per frame. The driver only bumps it when adding new buffers; per-frame re-reads are redundant guest-memory accesses. - Replace 'let used_elem = [...].concat()' with a stack [u8; 8]. The previous code allocated a Vec<u8> per injected frame in the hot path; the new code costs four byte copies and zero allocs. - Write used.idx ONCE at the end of the batch rather than after every frame. The virtio spec only requires a single update per publish; per-frame writes were redundant guest-memory accesses. - Return frames_injected (usize) so callers can pulse the IRQ line conditionally on actual new RX work. src/devices/virtio_net.rs::process_tx_queue - Replace per-frame Vec::concat with stack [u8; 8] (same fix as the RX path). - Read each TX descriptor segment directly into the packet buffer via packet.resize() + mem.read(&mut packet[off..]) instead of allocating an intermediate Vec<u8> and extend_from_slice'ing. Saves one allocation and one full memcpy per descriptor segment. - Reuse a single Vec<u8> packet buffer with capacity 1600 across all frames in the call instead of allocating fresh per frame. - Batch used.idx update at end of the batch (same as RX). src/vmm/mod.rs::net_poll_thread - Track previous-cycle pending state. Pulse KVM_IRQ_LINE only when (a) we actually injected new RX frames this cycle OR (b) interrupt_status went from clear -> pending across cycles. Previously the loop pulsed twice (assert level=1, then deassert level=0) on every cycle while interrupt_status was non-zero, even when the guest hadn't acked the previous pulse and no new work had arrived. Skipping the pulse pair when there's nothing new saves two ioctl(KVM_IRQ_LINE) calls per redundant cycle (~5-10 us each on the CRR hot path). Effect on the single-process CRR p50 (mean of 5 runs of 30 iterations each, voidbox+voidbox-SLIRP): before: 421 us p50 mean after: 380 us p50 mean (~10% improvement) The IRQ pulse change is the dominant contributor; the RX/TX heap allocation removals are correct cleanup but contribute below sample variance. Voidbox's gap to qemu+passt (163 us) shrinks from 2.6x to 2.3x; remaining gap candidates are MMIO exit cost, KVM_IRQ_LINE vs irqfd, and SlirpBackend lock contention.

The voidbox net-poll thread was raising IRQ 10 with two ioctl(KVM_IRQ_LINE) calls per pulse: assert level=1, then deassert level=0. Each ioctl is a syscall (~few us each on KVM); on the TCP CRR hot path with multiple IRQ deliveries per connection, the ioctl pair became a measurable share of per-iteration cost. Replace with KVM_IRQFD: one eventfd registered with the in-kernel irqchip via vm_fd().register_irqfd(&eventfd, 10) at thread startup. Pulsing the IRQ is now a single 8-byte write to the eventfd; the kernel asserts the IRQ line directly without a userspace round-trip through ioctl(). The legacy KVM_IRQ_LINE path is kept as a fallback when irqfd registration fails (kernel without irqfd support, irqchip routing not initialised). In normal operation the eventfd succeeds at startup and the legacy ioctls never run. Effect on the single-process CRR p50 (mean over 5 runs of 30 iterations, voidbox+voidbox-SLIRP): before this commit: ~380 us p50 after this commit: ~335 us p50 (~12% reduction) Cumulative with the previous virtio-net hot-path cleanups: baseline: 421 us p50 after all fixes: ~335 us p50 (~20% cumulative reduction) Voidbox's gap to qemu+passt (163 us) shrinks from 2.6x to 2.0x.

Without ioeventfd, every guest TX (write to QUEUE_NOTIFY MMIO with value=1) forces a KVM_RUN exit: vCPU thread dispatches into virtio-net's write_mmio handler, calls process_tx_queue, then re-enters KVM_RUN. On the TCP CRR hot path with multiple TX per connection that's a few microseconds of pure VM-exit overhead per packet on top of the actual network work. Register the eventfd at MMIO addr 0xd000_0050 with datamatch=1 (TX queue notify only). Now KVM consumes the matching MMIO write in-kernel and signals the eventfd; vCPU continues running uninterrupted. Net-poll thread sees the eventfd alongside flow events on the existing EpollDispatch (under a token in a tag space that doesn't collide with PROTO_TAG_*), drains it, and calls process_tx_queue on its own schedule. Notifies for queue 0 (RX, value=0) still take the slow path through the MMIO write handler — they're rare (only when guest adds new RX buffers) so the optimisation isn't needed there. Falls back to the synchronous MMIO-exit path if eventfd creation or KVM_IOEVENTFD registration fails. Effect on the single-process CRR p50 (mean over 5 runs of 30 iterations, voidbox+voidbox-SLIRP): before this commit: ~335 us p50 after this commit: ~278 us p50 (~17% reduction) Cumulative across the recent perf series: baseline: 421 us p50 + virtio-net cleanups: ~380 us p50 + KVM_IRQFD: ~335 us p50 + KVM_IOEVENTFD: ~278 us p50 (~34% cumulative) Voidbox's gap to qemu+passt (163 us) shrinks from 2.6x to 1.7x.

Restructures the host->guest RX path to eliminate the Arc<Mutex<VirtioNetDevice>> contention between the net-poll thread and the vCPU thread. Inspired by the user-suggested Option B: "net-poll -> rx_queue[vCPU] -> esa vCPU consume". Before: net-poll thread: let mut g = net_dev.lock(); // takes device mutex g.try_inject_rx(mem); // descriptor walk + writes drop(g); pulse_irq(); vCPU thread on MMIO exit: let g = net_dev.lock(); // waits for net-poll g.mmio_read(...); After: net-poll thread: drain backend frames into a Vec; // backend mutex only push each frame to pending_rx; // lock-free SegQueue pulse_irq(); // never touches device mutex vCPU thread on MMIO exit: let mut g = net_dev.lock(); // uncontended now g.flush_pending_rx(mem); // descriptor writes here g.mmio_read/mmio_write(...); Net-poll's hot path no longer holds the VirtioNetDevice mutex at all -- it only acquires the SLIRP backend Arc independently. vCPU's MMIO exits do the descriptor work in-context, paying for it once per exit but never waiting on a held lock. Implementation: src/devices/virtio_net.rs - new field pending_rx: Arc<crossbeam_queue::SegQueue<Vec<u8>>> - pending_rx() accessor returns a clone of the Arc - slirp_arc() exposes the backend Arc for direct net-poll access - new method flush_pending_rx(&mut self, mem) drains the SegQueue and writes RX descriptors using the same loop as try_inject_rx - try_inject_rx is now a thin wrapper that calls a new shared helper write_frames_to_rx_ring; same behaviour, structured so flush_pending_rx can share the descriptor-writing logic. src/vmm/mod.rs::net_poll_thread - Cache pending_rx + slirp Arcs once at thread startup; never touch the VirtioNetDevice mutex on the per-cycle path. - Drain backend frames into a reusable Vec, wrap each with a virtio-net header, push to the SegQueue, then pulse the IRQ. src/vmm/cpu.rs (MMIO dispatch) - Call guard.flush_pending_rx(guest_memory) at the top of the virtio-net MMIO read AND write handlers. Materialises any frames the net-poll thread queued since the last MMIO exit. Adds: crossbeam-queue = "0.3". Effect on the single-process CRR p50 (mean over 5 runs of 30 iterations, voidbox+voidbox-SLIRP): before this commit: ~278 us p50 after this commit: ~265 us p50 (~5% reduction) Modest improvement on the single-vCPU benchmark we have available -- the win is mostly architectural (eliminates a contention point that will become more meaningful with multi-vCPU guests, higher pps, and parallel TX/RX paths). Cumulative across the whole perf series: baseline: 421 us p50 + virtio-net cleanups: ~380 us p50 + KVM_IRQFD: ~335 us p50 + KVM_IOEVENTFD: ~278 us p50 + Option B SegQueue: ~265 us p50 (~37% cumulative) Voidbox's gap to qemu+passt (163 us) is now ~1.6x.

Wraps the device's interrupt_status register in Arc<AtomicU32> so the net-poll thread can read and update it without taking the device mutex. Three concrete benefits: 1. has_pending_interrupt() is now a single relaxed atomic load on &self -- safe to call from any thread, no lock, no contention. 2. The net-poll thread caches a clone of the Arc at startup and uses it directly for its idle-cycle 'do I need to pulse the IRQ?' check, removing one mutex acquisition per cycle. 3. interrupt_status |= 1 (set by RX inject) and interrupt_status &= !value (cleared by guest's INTERRUPT_ACK MMIO write) are now fetch_or / fetch_and atomic operations -- no read-modify-write race between the vCPU thread and the net-poll thread. The vCPU thread's MMIO read of INTERRUPT_STATUS still goes through the device mutex via the existing dispatcher, but the underlying operation is now a pure atomic load -- a follow-up that lets the dispatcher skip the lock for read-only MMIO accesses gets a cleaner path because the field no longer needs synchronisation through the mutex. Single-vCPU CRR is within sample noise of the previous measurement (~265 us p50 -> ~289 us across 5 runs of 30 iterations); the win is mostly architectural rather than measurable on this workload. Real benefit shows up with multi-vCPU guests, higher pps, or workloads where the net-poll and vCPU threads contend more aggressively.

Collects the SLIRP-vs-SLIRP / vs-pasta diagnostic tooling under one directory. Five files relocate, no behaviour change: scripts/bench-pasta.py -> tools/perf-harness/bench-pasta.py scripts/bench-compare-pasta.py -> tools/perf-harness/bench-compare-pasta.py scripts/bench-qemu-slirp.sh -> tools/perf-harness/bench-qemu-slirp.sh tools/crr-client.c -> tools/perf-harness/crr-client.c tools/qemu-init.sh -> tools/perf-harness/qemu-init.sh Updates path references in: - bench-qemu-slirp.sh (uses $SCRIPT_DIR for qemu-init.sh location; updated busybox extraction to climb two dirs up to repo root) - examples/crr_singleproc_bench.rs (doc + error message paths) - docs/passt-comparison.md (usage examples + extended example block that now also covers bench-qemu-slirp.sh and crr_singleproc_bench) Smoke-tested after the move: - tools/perf-harness/bench-pasta.py --iterations 1 ... passes - tools/perf-harness/bench-qemu-slirp.sh --backend libslirp passes

Eight follow-up fixes from PR #81 review: src/vmm/mod.rs: Extract `setup_tx_notify_ioeventfd` helper and gate the entire IOEVENTFD path on `epoll_arc.is_some()`. Fixes the original safety concern: the previous code registered KVM_IOEVENTFD even when no epoll dispatcher was available, which would have left guest TX notifies trapped in-kernel with no userspace drain — a silent hang. The helper rolls back the epoll registration if KVM_IOEVENTFD registration fails, so the two halves succeed or fail together. examples/crr_singleproc_bench.rs: Switch the host-side accept thread to non-blocking accept with a deadline check so the example never hangs forever if the guest fails to connect. Initial Copilot suggestion of a 2 ms sleep inflated each guest CRR sample by ~1.8 ms (sleep latency directly added to per-iter accept-pickup time). Reduced to 50 µs to keep the sample noise below the metric resolution. tools/perf-harness/bench-pasta.py: - `detect_host_gateway` now parses the route line by `via` keyword instead of indexing parts[2], so non-standard route formats don't silently pick up the wrong field. - CRR timer started before `srv.accept()` to match the voidbox-network-bench `crr_echo_server` semantics. tools/perf-harness/bench-qemu-slirp.sh: - Replace `time.sleep(60)` with `threading.Event().wait()` so the host echo server stays alive for the entire qemu run instead of timing out at 60 s. - Add fail-fast bind error handling so port collisions surface immediately instead of producing a confusing "no result" later. tools/perf-harness/qemu-init.sh: Derive the netmask from the CIDR prefix instead of hardcoding 255.255.255.0, so non-/24 networks work. tools/perf-harness/bench-compare-pasta.py: Remove unused `sign` variable. docs/passt-comparison.md: Update path reference from `scripts/` to `tools/perf-harness/`. Verified: voidbox single-process CRR p50 stays at ~280-310 µs (within noise of pre-fix baseline) and `cargo test --test network_baseline` passes 24/24.

Replace `std::mem::take(&mut *queue)` with an in-place `extend_from_slice` + `clear()` against a scratch Vec owned by `SlirpBackend`. The previous pattern moved the queue's allocation out and left a fresh `Vec::new()` (cap=0) behind, forcing the next `push_ready_events` to grow `extend_from_slice` from cap=0 every cycle. Heaptrack on the single-process CRR bench (30 iters) measured this single callsite as ~half of all allocations during the run: before: push_ready_events 4843 allocs (49% of total) drain_to_guest 4776 allocs (48% of total) total 12618 allocs after: push_ready_events gone from top callers drain_to_guest 3957 allocs (still hot, downstream) total 6885 allocs (-45%) p50 CRR latency is unchanged (~270 µs); the wall-clock floor is elsewhere on this workload. The win is reduced allocator churn (GC pressure, jitter on bulk paths, fewer slow-path mallocs under sustained load) — visible in the throughput bench rather than CRR microbench. The `pending_events` Mutex<Vec> is also pre-sized to `EVENTS_PRESIZE = 128` at construction so the very first push doesn't reallocate.

The SLIRP backend's per-second new-connection rate limit (`max_connections_per_second`, default 50/s) and concurrent- connection ceiling (`max_concurrent_connections`, default 64) are production anti-DoS defaults baked into `LocalSandbox`. They are hostile to microbenches that intentionally open hundreds of connections in a tight loop — at 51 connects/s the limiter starts returning RST to the guest, which crr-client sees as `ECONNREFUSED` on its very next connect and exits with rc=3. Reproduced as the "100-iter failure" in `crr_singleproc_bench`: 30 iters worked, 60 iters did not; the threshold was the 50/s limit, not anything in the network stack itself. Surface the two ceilings on `Sandbox::local()` as builder methods: .network_max_connections_per_second(u32::MAX) .network_max_concurrent_connections(usize::MAX) `None` keeps the production defaults, so this is purely additive. The bench now uses both. 500-iter run reproduces clean (p50 268 µs, p99 1.6 ms, host accepts 500/500).

Both `flush_pending_rx` and `try_inject_rx` previously built a fresh `Vec<Vec<u8>>` on every MMIO exit and handed it to `write_frames_to_rx_ring`, which consumed it by value. The pattern dropped the outer-Vec allocation and forced the next call to grow it from cap=0 — heaptrack on the CRR microbench measured the flush_pending_rx site at 173 calls / 108 MB peak, the largest remaining alloc consumer after the SLIRP `ready_scratch` fix. `write_frames_to_rx_ring` now takes `&mut Vec<Vec<u8>>` and drains in place via `drain(..)` / `append`, so callers reuse a long-lived scratch buffer: - `flush_pending_rx` uses a new `flush_scratch` field on `VirtioNetDevice`, populated from `pending_rx` (SegQueue) and cleared at end. - `try_inject_rx` reuses the existing `rx_scratch` field that was already paired with `get_rx_frames`; the trailing `mem::take` in `get_rx_frames` is now followed by a `clear()` + restore at the end of `try_inject_rx`, so the capacity persists across the round-trip. Heaptrack on 100-iter CRR: before this commit: 6885 allocs / 30 iters = 229/iter after this commit: 18926 allocs / 100 iters = 189/iter Aggregate from the original baseline: baseline (before all fixes): ~421 allocs/iter this commit: ~189 allocs/iter (-55%) p50 latency unchanged at ~275 µs as expected — alloc reduction shows up in throughput and tail-latency stability, not the CRR floor.

`relay_tcp_nat_data` builds a temporary `Vec<Vec<u8>>` per call because the relay can't push directly to `inject_to_guest` while iterating `flow_table` (both are `&mut self`). The previous pattern allocated a fresh `Vec::new()` every cycle, which heaptrack flagged as the biggest remaining contributor inside `drain_to_guest`'s call tree after the prior `ready_scratch` and `flush_scratch` fixes. Move the buffer onto `SlirpBackend` as `relay_frames_scratch` and use the standard `mem::take` → process → restore pattern so the buffer's capacity persists across `drain_to_guest` calls. The two trailing `inject_to_guest.append(&mut frames_to_inject)` sites already preserve capacity (Vec::append leaves the source empty but with its allocation intact); only the entry-point `Vec::new()` was discarding work. Cumulative impact on the 100-iter CRR microbench: baseline (before any of these fixes): ~421 allocs/iter after ready_scratch + flush_scratch: ~189 allocs/iter after relay_frames_scratch (this PR): ~93 allocs/iter (-78%) p50 latency continues at ~275 µs; the floor is dominated by KVM-exit / wakeup costs, not allocator churn. The win shows up under sustained load where reduced allocator pressure improves tail-latency stability and per-frame jitter.

Three of the relay functions called from `drain_to_guest` (`relay_tcp_nat_data`, `relay_icmp_echo`, `relay_udp_flows`) each built a per-call `Vec<FlowKey>` to side-step the `&mut self` / `flow_table` borrow conflict. The Vecs were allocated, populated, drained, and dropped on every cycle. The UDP relay built two — one for the stale-sweep, one for the readiness loop. Add a single `flow_keys_scratch: Vec<FlowKey>` field on `SlirpBackend` and rotate it through all four sites with the mem::take → process → restore pattern (the relays run sequentially inside `drain_to_guest`, so one buffer suffices). Each iteration uses `Vec::drain(..)` instead of for-by-value so capacity is preserved across the consume. Heaptrack on the 100-iter CRR microbench: before this commit: 9296 allocs (~93/iter) after this commit: 4103 allocs (~41/iter) temporary allocs: 5546 → 574 (-90%) Cumulative from the original baseline (start of this round): ~421 allocs/iter → ~41 allocs/iter (-90%) p50 latency unchanged at ~275 µs as predicted; the wall-clock floor is dominated by KVM exits / vCPU wakeups. The gain shows up as reduced allocator pressure on bulk paths and fewer slow-path mallocs under sustained load. Top remaining alloc callsites are now per-frame `Vec<u8>` from `build_tcp_packet_static` (one allocation per TCP frame) and TX queue frame parsing — both intrinsic to the protocol shape; further reduction needs a pool/arena, not a scratch hoist.

Same fix as `crr_singleproc_bench`: the bench's CRR phase opens 30 connections in <1s, which trips the production SLIRP rate limiter (50 conn/s) and surfaces as a 2 s "crr echo channel receive error" instead of a real number. Use the new `Sandbox::local()` rate-limit knobs to lift both ceilings (max_connections_per_second + max_concurrent_connections) explicitly. Production sandboxes are unaffected — the lift is opt-in.

Plan doc for the next perf round. After #81's user-space alloc reductions exhausted (-90% allocs/iter, p50 unchanged), the remaining floor is kernel↔userspace transitions, MMIO exits, and single-queue serialization. Three experiments in scope, ranked by risk × payoff: 1. io_uring for SLIRP host-socket I/O — start here 2. splice() / sendfile() zero-copy on bulk paths 3. MSI-X virtio + multi-queue for vCPU scaling Non-goal: TAP + passt-style host bypass. Routing through an external passt would close the latency gap to passt but moves the DNS interception, port-forwarding, deny-list, and rate-limiting feature surface out of voidbox — and loses the in-process observability we currently get from instrumenting SLIRP directly. Full SLIRP-path observability is a hard requirement. Each experiment lands as its own commit, gated behind a Cargo feature so the #81 baseline can A/B against it without a revert. Measurements use the harness shipped in #81.

First commit on the architectural-experiments branch (#83). Adds a `UringBatch` wrapper around `io_uring::IoUring` with the submit / drain shape the SLIRP relay will use to batch host-socket recv / send into single `io_uring_enter` round-trips. Key shape: - One `UringBatch` is single-owner: the SLIRP `net_poll_thread` constructs and drives one. No locking, no cross-thread sharing. - SQEs are tagged with `(UringOp, correlation_id)` packed into `user_data` so the completion drain routes a CQE back to its originating flow without a side table. Low 32 bits = correlation id, top 32 bits = op tag. - `submit_recv` / `submit_send` are `unsafe` because the kernel references the user buffer asynchronously; the caller's safety contract requires `buf` to outlive the matching CQE. - The existing `EpollDispatch` keeps owning the readiness signal — io_uring replaces only the data-plane syscalls, not the wake-up. Two layers stay separable so the feature can be toggled off without touching the relay state machine. Behavior unchanged: nothing wires this in yet. Cargo feature `io-uring` (off by default) gates both the new module and the `io-uring = "0.7"` dependency. Module is `#![allow(dead_code)]` for now; the next commit on this branch wires the relay TCP recv / send paths through it and removes the allow. Tests: - 4 unit tests in `src/network/uring.rs` cover user-data round trip + a real `submit_send` -> `submit_recv` cycle across a `socketpair` (skipped on kernels without io_uring). - `cargo test --features io-uring --lib`: 381 passed. - `cargo test --test network_baseline` (default features): 24/24. - `cargo clippy --all-targets [-- -D warnings]` clean both with and without the feature. Methodology per `docs/perf-architectural-experiments.md`: each experiment lands as one feature-gated commit so the #81 baseline can A/B against it without a revert. This is the infrastructure commit; the next one wires + measures.

Companion to `crr_singleproc_bench`: drives M concurrent crr-client processes in the same guest so the SLIRP relay sees N>1 ready flows per `net_poll_thread` cycle. The single-flow microbench can't see io_uring batching or multi-queue wins because there's nothing to batch / parallelize with one ready flow at a time; this bench is the workload the architectural experiments on this branch (#83) need. Per-flow `crr-client` writes its summary line to its own `/tmp/crr_results/$i.txt`; the trailing shell loop concatenates all M lines for the host to parse. Aggregation reports median-of-p50s, max p99, mean-of-means, and aggregate qps. Note: busybox-static lacks `seq`, so the flow-id list is materialized on the host and inlined into the shell command. ## Baseline (this branch's tip = #81 + io_uring scaffold) Single net_poll_thread, no architectural changes wired: | M | Median p50 | Max p99 | Aggregate qps | |---|-----------:|--------:|--------------:| | 1 | 275 µs | ~2 ms | ~3636 | | 2 | 473 µs | 12.9 ms | 2173 | | 4 | 732 µs | 13.2 ms | 2370 | | 8 | 2043 µs | 14.5 ms | 2242 | Reading: - Aggregate qps saturates at ~2200-2400 regardless of M — the single net_poll_thread is the bottleneck. - Per-flow p50 grows ~linearly with M (M=8 each flow takes 7.4× the M=1 p50). - p99 jumps to 12-14 ms at M=2 already; tail-latency is dominated by per-flow head-of-line blocking through the single epoll loop. This is exactly the workload io_uring batching, splice, and multi-queue should move. The io_uring wiring lands in the next commit on this branch with measurements against this table.

dpsoft added 17 commits May 6, 2026 18:30

Base automatically changed from passt-comparison-harness to main May 7, 2026 00:37

dpsoft added 2 commits May 6, 2026 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(slirp): architectural experiments — io_uring / splice / multi-queue#83

perf(slirp): architectural experiments — io_uring / splice / multi-queue#83
dpsoft wants to merge 19 commits intomainfrom
slirp-perf-architectural-exp

dpsoft commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dpsoft commented May 7, 2026

Goal

Non-goal: TAP / passt-style host bypass

Experiments (ranked by risk × payoff)

1. io_uring for SLIRP host-socket I/O — start here

2. splice() / sendfile() zero-copy on bulk paths

3. MSI-X virtio + multi-queue for vCPU scaling

Tooling

Methodology

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `io_uring` for SLIRP host-socket I/O — start here

2. `splice()` / `sendfile()` zero-copy on bulk paths